In this section we would like to investigate more about the regional characteristics of the case.

Visual Analysis

We will first use data visualization to give an overview of the distribution of suspect’s features.

Borough

complaint %>% 
  filter(level == "FELONY") %>% 
  drop_na(borough) %>% 
  group_by(year, borough) %>% 
  dplyr::summarize(n_obs = n()) %>% 
  ggplot(aes(x = reorder(borough, -n_obs), y = n_obs, fill = reorder(borough, -n_obs))) +
  geom_bar(stat = 'identity') +
  labs(
    title = "Frequency of Felonies by Borough (2016-2022)",
    x = "Borough",
    y = "Frequency"
  ) +
  theme(legend.position = "none")

According to the plot, Brooklyn had the most felonies, followed by Manhattan, and Bronx has about the same felonies as Queens. Staten Island has the least felonies.

complaint %>% 
  filter(level == "FELONY") %>% 
  drop_na(borough) %>% 
  group_by(year, borough) %>% 
  dplyr::summarize(n_obs = n()) %>% 
  dplyr::summarize(borough, percentage = n_obs / sum(n_obs)) %>% 
  ggplot(aes(x = year, y = percentage, fill = borough)) +
  geom_bar(stat = 'identity') +
  labs(
    x = "Year",
    y = "Proportion",
    title = "Proportions of Felonies by Borough and Year",
    fill = "Borough"
  )

The proportion of felonies does not appear to have changed significantly over the years

Statistical Testing

We use statistical tests to find if there is a difference in monthly crime records from different regions.

Test average number of crimes per day

anova_table =
  complaint %>% 
  filter(borough %in% c("BRONX", "BROOKLYN", "MANHATTAN", "QUEENS", "STATEN ISLAND")) %>% 
  mutate(monthly = str_c(as.character(year), as.character(month))) %>% 
  group_by(borough,monthly) %>%
  dplyr::summarise(
    n_obs = n(),
    n_felony = sum(level == "FELONY"),
    felony_rate = n_felony / n_obs) 

anova_table %>% 
  group_by(borough) %>% 
  dplyr::summarise(
    monthly_cases = mean(n_obs)
  )
## # A tibble: 5 × 2
##   borough       monthly_cases
##   <fct>                 <dbl>
## 1 BRONX                 8316.
## 2 BROOKLYN             11060.
## 3 MANHATTAN             9423.
## 4 QUEENS                7843.
## 5 STATEN ISLAND         1656.
res_monthly_case = aov(n_obs ~ factor(borough), data = anova_table)
summary(res_monthly_case)
##                  Df    Sum Sq   Mean Sq F value Pr(>F)    
## factor(borough)   4 4.146e+09 1.036e+09    1345 <2e-16 ***
## Residuals       400 3.083e+08 7.708e+05                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pairwise.t.test(anova_table$n_obs, anova_table$borough, p.adj = 'bonferroni')
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  anova_table$n_obs and anova_table$borough 
## 
##               BRONX   BROOKLYN MANHATTAN QUEENS 
## BROOKLYN      < 2e-16 -        -         -      
## MANHATTAN     1.2e-13 < 2e-16  -         -      
## QUEENS        0.0067  < 2e-16  < 2e-16   -      
## STATEN ISLAND < 2e-16 < 2e-16  < 2e-16   < 2e-16
## 
## P value adjustment method: bonferroni

The ANOVA test shows that the five boroughs do not have equal monthly cases, and the pairwise t test further reveals each borough differs from the others on monthly cases. Based on monthly cases, from highest to lowest, Brooklyn, Manhattan, Bronx, Queens, Staten Island.

Test felony rate

anova_table %>% 
  group_by(borough) %>% 
  dplyr::summarise(
    mean_felony_rate = mean(felony_rate)
  )
## # A tibble: 5 × 2
##   borough       mean_felony_rate
##   <fct>                    <dbl>
## 1 BRONX                    0.296
## 2 BROOKLYN                 0.330
## 3 MANHATTAN                0.319
## 4 QUEENS                   0.324
## 5 STATEN ISLAND            0.247
res_felony_rate = aov(felony_rate ~ factor(borough), data = anova_table)
summary(res_felony_rate)
##                  Df Sum Sq Mean Sq F value Pr(>F)    
## factor(borough)   4 0.3743 0.09358   260.4 <2e-16 ***
## Residuals       400 0.1438 0.00036                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pairwise.t.test(anova_table$felony_rate, anova_table$borough, p.adj = 'bonferroni')
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  anova_table$felony_rate and anova_table$borough 
## 
##               BRONX   BROOKLYN MANHATTAN QUEENS 
## BROOKLYN      < 2e-16 -        -         -      
## MANHATTAN     3.6e-12 0.0017   -         -      
## QUEENS        < 2e-16 0.3439   0.9615    -      
## STATEN ISLAND < 2e-16 < 2e-16  < 2e-16   < 2e-16
## 
## P value adjustment method: bonferroni

The ANOVA test shows that the five boroughs do not have equal felony rate, and the pairwise t test further reveals that Bronx and Staten Island are different from others; Manhattan differs from Brooklyn, but does not differ from Queens; while Brooklyn does not differ from Queens either. Based on felony rate, from highest to lowest, Brooklyn, Queens, Manhattan, Bronx, Staten Island.

###Partial conclusion

The test above shows that high monthly cases do not necessarily mean high felony rate. While Brooklyn both has the highest monthly cases and the highest felony rate, Staten Island both has the lowest monthly cases and the lowest felony rate.

Plotly 2022 map based on borough

NYPD_plot = 
  complaint %>% 
  mutate(
    text_label = str_c("Borough: ", borough, "\nPrecinct: ", precinct, "\nLevel: ",  level, "\nOffense: ", offense)) %>% 
  filter(year == 2022) %>% 
  plot_ly(
    lat = ~latitude,
    lon = ~longitude,
    type = "scattermapbox",
    mode = "markers",
    alpha = 0.2,
    color = ~ borough,
    text = ~text_label) %>%
  layout(
    mapbox = list(
      style = "carto-positron",
      zoom = 9,
      center = list(lon = -73.9, lat = 40.7)
    )
  )
  
NYPD_plot